class: center, middle, inverse, title-slide .title[ # Part 1 - R Basics ] .author[ ### Graham Bearden ] .institute[ ### University of Washington ] .date[ ### October 3, 2024 ] --- class: inverse, middle, center # Getting Started --- # Set Up - Set up your machine to run R - Install R [on a Mac with an Intel chip](https://cran.rstudio.com/bin/macosx/big-sur-x86_64/base/R-4.4.1-x86_64.pkg), [on a Mac with a silicon M1/M2/M3 chip](https://cran.rstudio.com/bin/macosx/big-sur-arm64/base/R-4.4.1-arm64.pkg), or [on a PC](https://cran.rstudio.com/bin/windows/base/R-4.4.1-win.exe) - Install RStudio [on a Mac](https://download1.rstudio.org/electron/macos/RStudio-2024.04.2-764.dmg) or [on a PC](https://download1.rstudio.org/electron/windows/RStudio-2024.04.2-764.exe) - Create a [ChatGPT](https://chat.openai.com/auth/login) or [Perplexity](https://www.perplexity.ai/) account --- # R and the R Programmer - Typical uses for R - Data preparation (cleaning, reshaping, creation) - Data analysis - Data visualization - Automation - Productizing - Mental models and analysis tools - Excel/SPSS/Stata --> computational programming tools like R - What all R programmers do - Google for answers - Borrow code - Ask friends and pros for help - Ask ChatGPT, Perplexity, or another AI assistant - Useful references (see Reference Materials in the [syllabus](https://gbearden.github.io/r_course_evans_school/) for more) - help() - ?? - [R Bloggers](https://www.r-bloggers.com/) - [stackoverflow](https://stackoverflow.com/questions/tagged/r) --- # Quick Tour of R Studio - Script vs. notebook - Scripts are good for code development, running code in production, version control, and multi-file analyses - Notebooks are good for sharing analysis, documentation, mixed code in a single file - Run code from the notebook and the console - Console vs. notebooks/scripts - Make sure you can install packages - Command line/Console - R Studio GUI -- ```r install.packages('tidyverse') ``` --- # Quick Tour of R Studio - Script vs. notebook - Scripts are good for code development, running code in production, version control, and multi-file analyses - Notebooks are good for sharing analysis, documentation, mixed code in a single file - Run code from the notebook and the console - Console vs. notebooks/scripts - Make sure you can install packages - Command line/Console - R Studio GUI <img src="./figures/r_studio_install_packages.png" width="60%" /> --- class: inverse, middle, center # Libraries, functions, data --- # <img src="figures/toolbox.png" style="height: 50px"/> <span style='color:#6DA34D'>Libraries</span>, functions, data - Libraries (or packages) are collections of functions (and datasets) - ~21,000 libraries on [CRAN](https://cran.r-project.org/web/packages/) ```r install.packages('tidyverse') library(tidyverse) ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Functions perform operations in R ```r some_numbers <- c(1,2,4) # min(), mean(), max(), and sd() are functions. c() is actually a function too! min(some_numbers) mean(some_numbers) max(some_numbers) sd(some_numbers) ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Functions perform operations in R ``` r some_numbers <- c(1,2,4) # min(), mean(), max(), and sd() are functions. c() is actually a function too! min(some_numbers) ## [1] 1 ``` ``` r mean(some_numbers) ## [1] 2.333333 ``` ``` r max(some_numbers) ## [1] 4 ``` ``` r sd(some_numbers) ## [1] 1.527525 ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Function arguments - Let's look at arguments in the `lm` and `min` functions - Arguments are used to specify the data and operation - Arguments are comma separated ```r help(lm) help(min) ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Function arguments - Let's look at arguments in the `lm` and `min` functions - Arguments are used to specify the data and operation - Arguments are comma separated ```r some_numbers <- c(1, 2, NA, 4) min(some_numbers) mean(some_numbers) max(some_numbers) sd(some_numbers) ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Function arguments - Let's look at arguments in the `lm` and `min` functions - Arguments are used to specify the data and operation - Arguments are comma separated ``` r some_numbers <- c(1,2,NA, 4) min(some_numbers) ## [1] NA ``` ``` r mean(some_numbers) ## [1] NA ``` ``` r max(some_numbers) ## [1] NA ``` ``` r sd(some_numbers) ## [1] NA ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Function arguments - Let's look at arguments in the `lm` and `min` functions - Arguments are used to specify the data and operation - Arguments are comma separated ``` r some_numbers <- c(1,2,NA, 4) min(some_numbers, na.rm = TRUE) ## [1] 1 ``` ``` r mean(some_numbers, na.rm = TRUE) ## [1] 2.333333 ``` ``` r max(some_numbers, na.rm = TRUE) ## [1] 4 ``` ``` r sd(some_numbers, na.rm = TRUE) ## [1] 1.527525 ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data - Anatomy of a function ```r compute_mean <- function(x) { # Calculate the sum of all values in `x` x_sum <- sum(x, na.rm = TRUE) # We'll learn more about the ! operator later # Just know that the code below used to create `x_clean` removes NA values x_clean <- x[!is.na(x)] # Count the number of values in `x` x_len <- length(x_clean) # Divide `x_sum` by `x_len` x_mean <- x_sum / x_len # `return` is used to tell R the output of the function return(x_mean) } ``` --- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data **Exercise - 5 minutes** - Ask [ChatGPT](https://chat.openai.com/) or [Perplexity](https://www.perplexity.ai/) to write a function that sums two numbers - Copy and paste the function into your R script - Run the code in the command line to create the function - Try to use the function!
−
+
05
:
00
--- # <img src="figures/tools.png" style="height: 50px"/> Libraries, <span style='color:#6DA34D'>functions</span>, data **Exercise - 5 minutes** - Ask [ChatGPT](https://chat.openai.com/) or [Perplexity](https://www.perplexity.ai/) to write a function that sums two numbers - Copy and paste the function into your R script - Run the code in the command line to create the function - Try to use the function! ```r # A function called sum_two_numbers that takes two arguments. # Argument 'x' represents the first number to be added. # Argument 'y' represents the second number to be added. sum_two_numbers <- function(x, y) { # Calculate the sum of x and y and store it in the variable 'result' result <- x + y # Return the result as the output of the function return(result) } ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data type is an important concept in all programming languages - Creates a set of rules for inputs and outputs - Data types - Vectors, tibbles, data frames - `str_subset()` removes a value from a string - `filter()` removes an observation from a tibble or data frame - Integers, numbers, characters, factors - You can sum integers and numbers, but not character strings -- Confirm data type ```r class(c('Washington', 'Oregon', 'Idaho')) is.character(c('Washington', 'Oregon', 'Idaho')) is.factor(c('Washington', 'Oregon', 'Idaho')) ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data type is an important concept in all programming languages - Creates a set of rules for inputs and outputs - Data types - Vectors, tibbles, data frames - `str_subset()` removes a value from a string - `filter()` removes an observation from a tibble or data frame - Integers, numbers, characters, factors - You can sum integers and numbers, but not character strings Confirm data type ``` r class(c('Washington', 'Oregon', 'Idaho')) ## [1] "character" ``` ``` r is.character(c('Washington', 'Oregon', 'Idaho')) ## [1] TRUE ``` ``` r is.factor(c('Washington', 'Oregon', 'Idaho')) ## [1] FALSE ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data type is an important concept in all programming languages - Creates a set of rules for inputs and outputs - Data types - Vectors, tibbles, data frames - `str_subset()` removes a value from a string - `filter()` removes an observation from a tibble or data frame - Integers, numbers, characters, factors - You can sum integers and numbers, but not character strings Confirm data type ```r class(c('Washington', 'Oregon', 'Idaho')) is.character(c('Washington', 'Oregon', 'Idaho')) is.factor(c('Washington', 'Oregon', 'Idaho')) ``` -- Coerce data type ```r as.factor(c('Washington', 'Oregon', 'Idaho')) class(as.factor(c('Washington', 'Oregon', 'Idaho'))) ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data type is an important concept in all programming languages - Creates a set of rules for inputs and outputs - Data types - Vectors, tibbles, data frames - `str_subset()` removes a value from a string - `filter()` removes an observation from a tibble or data frame - Integers, numbers, characters, factors - You can sum integers and numbers, but not character strings Coerce data type ``` r as.factor(c('Washington', 'Oregon', 'Idaho')) ## [1] Washington Oregon Idaho ## Levels: Idaho Oregon Washington ``` ``` r class(as.factor(c('Washington', 'Oregon', 'Idaho'))) ## [1] "factor" ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data types - **Vectors**, tibbles, data frames - Vectors contain 1 or more values in a string - Call a specific element by its location in a vector ``` r c('Washington', 'Oregon', 'Idaho') ## [1] "Washington" "Oregon" "Idaho" ``` ``` r 1:10 ## [1] 1 2 3 4 5 6 7 8 9 10 ``` ``` r rep(1:2, times = 2) ## [1] 1 2 1 2 ``` ``` r seq(from = 0, to = 100, by = 10) ## [1] 0 10 20 30 40 50 60 70 80 90 100 ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data types - **Vectors**, tibbles, data frames - Vectors contain 1 or more values in a string - Call a specific element by its location in a vector ```r c('Washington', 'Oregon', 'Idaho')[2] seq(from = 0, to = 100, by = 10)[6] ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data types - **Vectors**, tibbles, data frames - Vectors contain 1 or more values in a string - Call a specific element by its location in a vector ``` r c('Washington', 'Oregon', 'Idaho')[2] ## [1] "Oregon" ``` ``` r seq(from = 0, to = 100, by = 10)[6] ## [1] 50 ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data types - Vectors, **tibbles**, data frames - Tibbles look a little like spreadsheets - Tibbles contain metadata on the dataset - Dimensions - Variable types - Truncate variable names and values ``` r tibble( x = c(1:3) , y = c(4:6) , z = c('Washington', 'Oregon', 'Idaho') ) ## # A tibble: 3 × 3 ## x y z ## <int> <int> <chr> ## 1 1 4 Washington ## 2 2 5 Oregon ## 3 3 6 Idaho ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Call a specific variable by name or location in the tibble - Variables are vectors - Use the `$` between the tibble name and the variable name - Also able to call the column (or row) by index number ```r starwars$name starwars[,1] ``` ``` ## [1] "Luke Skywalker" "C-3PO" "R2-D2" ## [4] "Darth Vader" "Leia Organa" "Owen Lars" ## [7] "Beru Whitesun Lars" ``` --- # <img src="figures/data.png" style="height: 50px"/> Libraries, functions, <span style='color:#6DA34D'>data</span> - Data types - Vectors, tibbles, **data frames** - Data frames are like tibbles minus metadata (and truncated print) ``` r data.frame( x = c(1:3) , y = c(4:6) , z = c('Washington', 'Oregon', 'Idaho') ) ## x y z ## 1 1 4 Washington ## 2 2 5 Oregon ## 3 3 6 Idaho ``` --- class: inverse, middle, center # Let's apply what we learned to real data --- # Let's apply what we learned to real data You are an analyst at the City of Seattle. City leadership wants to develop new policy to regulate Airbnb properties. Over the next month, you will be asked to prepare findings that you observe in the [dataset compiled on Airbnb properties](https://www.kaggle.com/shanelev/seattle-airbnb-listings). -- **Exercise - 5 minutes** - What type of data object is `airbnb`? - What is the fewest number of `reviews` for a property? - What is the average number of `reviews` for a property? - What is the range of number of `bathrooms` in the `airbnb` data? -- To answer the questions, import the Seattle `airbnb` dataset. ```r airbnb <- read_csv('https://bit.ly/3oadz2L') ```
−
+
05
:
00
--- # Let's apply what we learned to real data **Exercise - 5 minutes** - What type of data object is `airbnb`? - What is the fewest number of `reviews` for a property? - What is the average number of `reviews` for a property? - What is the range of number of `bathrooms` in the `airbnb` data? ``` r class(airbnb) ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` -- ``` r min(airbnb$reviews) ## [1] 0 ``` ``` r mean(airbnb$reviews) ## [1] 48.15021 ``` -- ``` r min(airbnb$bathrooms, na.rm = TRUE) ## [1] 0 ``` ``` r max(airbnb$bathrooms, na.rm = TRUE) ## [1] 8 ``` --- class: inverse, middle, center # Base R functions to explore tibbles --- # Base R functions to explore tibbles Get in the habit of running these functions when you open a new dataset - `head()` shows you the top of a tibble (or data frame) - `head()` argument defaults to 6 - `tail()` shows the bottom of a tibble - `arrange()` pairs nicely with subsetting functions to explore your data - `summary()` shows summary statistics on all variables in a dataset - Reported summary statistics depend on the variable type - Min, max, mean, and median are reported for numeric variables - `ls()` shows all variables in a tibble - You can also use `ls()` without calling an object between the parentheses to see all objects in your workspace - `str()` tells the variable type and selected variable values in a tibble for all variables - `dim()` tells you the dimensions of your dataset - `nrow()` reports the number of rows only - `ncol()` reports the number of columns only --- # Base R functions to explore tibbles **Exercise 2 - 5 minutes** - How many observations (or rows) are in `airbnb`? - What is the median value for `rating`? - How many variables are in `airbnb`? <i>Hint: There are multiple ways to answers these questions with the functions you know</i>
−
+
05
:
00
--- # Base R functions to explore tibbles **Exercise 2 - 5 minutes** - How many observations (or rows) are in `airbnb`? - What is the median value for `rating`? - How many variables are in `airbnb`? ``` r nrow(airbnb) ## [1] 7423 ``` -- ``` r median(airbnb$rating, na.rm = TRUE) ## [1] 5 ``` -- ``` r ncol(airbnb) ## [1] 14 ``` -- Other methods to answer questions ```r dim(airbnb) summary(airbnb) ``` --- # Base R functions to explore vectors - `table()` shows you the distribution of values in a vector - Indexed by name and position - `length()` tells you in the number of elements in a vector - `unique()` shows you the unique values in a vector - `sort()` orders values in a vector - `summary()` shows you descriptive statistics on a vector - `summary()` can run on a vector or tibble -- You can nest multiple functions ```r length(unique(airbnb$room_type)) ``` --- # Base R functions to explore vectors **Exercise 3 - 5 minutes** - How many distinct `address` values are there? - Which is the third `host_id` value after sorting unique ids alphabetically? - Is 'yurt' a `room_type` value?
−
+
05
:
00
--- # Base R functions to explore vectors **Exercise 3 - 5 minutes** - How many distinct `address` values are there? - Which is the third `host_id` value after sorting unique ids alphabetically? - Is 'yurt' a `room_type` value? ``` r length(unique(airbnb$address)) ## [1] 4 ``` -- ``` r head(sort(unique(airbnb$host_id)), 3) ## [1] 2536 4193 4797 ``` -- ``` r unique(airbnb$room_type) ## [1] "Entire home/apt" "Private room" "Shared room" ``` --- # Troubleshooting your R code Make sure... - You loaded your libraries - Arguments in your functions are comma-separated - Your functions have start and closing parentheses - You have start and end quotation marks where relevant - The variable type is correct to perform the function you call - For example you probably don't want to run `min()` on a character string - Your assignment operator looks like this: `<-` - You're running code inside a code chunk (in a notebook) -- Feel free to ask ChatGPT why you see an error! ``` Why does this code throw an error in R? [Your code] ``` --- # Practice with climate data By day you're an analyst at the City of Seattle; by night you're a climate analyst at the Earth Institute. Over the next four weeks, you will analyze [climate data](https://www.kaggle.com/sohelranaccselab/global-climate-change) to answer questions that help the Earth Institute understand temperature and temperature change in cities and countries around the world. -- **Exercise 4 - 10 minutes** - What is the earliest `year` in the `climate` dataset? - What is the range of `uncertainty` values? - How many cities and countries are in the dataset? - In how many rows in the dataset is Japan the `country` value? -- <b>Begin the exercise by importing the climate dataset in R Studio</b> ```r climate <- read_csv('https://bit.ly/3kKErEb') ```
−
+
10
:
00
--- # Practice with climate data **Exercise 4 - 10 minutes** - What is the earliest `year` in the `climate` dataset? - What is the range of `uncertainty` values? - How many cities and countries are in the dataset? - In how many rows in the dataset is Japan the `country` value? ``` r min(climate$year) ## [1] 1743 ``` -- ``` r min(climate$uncertainty, na.rm = TRUE) ## [1] 0.04 ``` ``` r max(climate$uncertainty, na.rm = TRUE) ## [1] 14.037 ``` -- ``` r length(unique(climate$city)) ## [1] 100 ``` ``` r length(unique(climate$country)) ## [1] 49 ``` --- # Practice with climate data **Exercise 4 - 10 minutes** - What is the earliest `year` in the `climate` dataset? - What is the range of `uncertainty` values? - How many cities and countries are in the dataset? - In how many rows in the dataset is Japan the `country` value? ``` r table(climate$country)['Japan'] ## Japan ## 4095 ``` --- # More practice - MBA admissions You can find the a description of the synthetic MBA admissions data [here](https://www.kaggle.com/datasets/taweilo/mba-admission-dataset). The data were generated from the University of Pennsylvania Wharton Class of 2025. -- **Exercise 4 - 8 minutes** - What is the lowest `gpa` in the `mba` dataset? - What is the average number of years of work experience (`work_exp`)? - How many applicants are international students? - In how many rows in the dataset is Consulting the `work_industry` value? -- <b>Begin the exercise by importing the mba admissions dataset in R Studio</b> ```r mba <- read_csv("https://bit.ly/4evYpeW") ```
−
+
08
:
00
--- # More practice - MBA admissions **Exercise 4 - 8 minutes** - What is the lowest `gpa` in the `mba` dataset? - What is the average number of years of work experience (`work_exp`)? - How many applicants are international students? - In how many rows in the dataset is Consulting the `work_industry` value? ``` r min(mba$gpa) ## [1] 2.65 ``` -- ``` r mean(mba$work_exp, na.rm = TRUE) ## [1] 5.016952 ``` -- ``` r table(mba$international) ## ## FALSE TRUE ## 4352 1842 ``` --- # More practice - MBA admissions **Exercise 4 - 8 minutes** - What is the lowest `gpa` in the `mba` dataset? - What is the average number of years of work experience (`work_exp`)? - How many applicants are international students? - In how many rows in the dataset is Consulting the `work_industry` value? ``` r table(mba$work_industry)['Consulting'] ## Consulting ## 1619 ```